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Abstract 

Studies of the contextual and linguistic factors that constrain discourse phenomena such as reference are 
coming to depend increasingly on annotated language corpora. In preparing the corpora, it is important 
to evaluate the reliability of the annotation, but methods for doing so have not been readily available. In 
this report, I present a method for computing reliability of coreference annotation. First I review a method 
for applying the information retrieval metrics of recall and precision to coreference annotation proposed by 
Marc Vilain and his collaborators. I show how this method makes it possible to construct contingency tables 
for computing Cohen's re, a familiar reliability metric. By comparing recall and precision to reliability on the 
same data sets, I also show that recall and precision can be misleadingly high. Because re factors out chance 
agreement among coders, it is a preferable measure for developing annotated corpora where no pre-existing 
target annotation exists. 



1 Two Reliability Metrics 

Two equivalent metrics for quantifying inter- 
rater reliability between pairs of coders are 
Cohen's k coefficient of agreement (|1960|) 
and Krippendorff's a ( |1980| ). The formulas 
for each are shown in (|1|) and (Q). 
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Briefly, Cohen's k is cast in terms of the 
amount of agreement between coders that 
exceeds chance expectations. The numera- 
tor of the ratio in ([]]) is the proportion of 
observed agreements (j>a ) l ess the propor- 
tion expected to agree by chance (pa e )', the 
denominator is the total proportion (100%) 
less the the proportion expected to agree 
by chance. Conversely, Krippendorff's a is 
cast in terms of the extent to which the 
observed disagreements between coders is 
below chance expectation; it is the total 
probability less the ratio of observed dis- 
agreements to expected disagreements. The 
observed probability of agreement and dis- 
agreement must sum to one, as must the 
expected probability of agreement and dis- 
agreement ((H) and (f|)). By substitution, it 
can be shown that tt equals a ((H) - (§)). 

The reliability measures depend crucially on 
a hypothesis of chance expectation. In (|Co 



i 7 ) hen, 1960Q and ( Krippendorff, 1980|) , chance 



expectation is derived from the marginals of 
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Table 1: A 2-by-2 coincidence matrix 

a coincidence matrix classifying the response 
categories of one coder by the response cate- 
gories of another coder. Table [I] illustrates a 
simple 2-by-2 coincidence matrix. A coinci- 
dence matrix classifies a set of data in a way 
that shows, for a given set of classification 
categories (e.g., A versus B), how the data 
is cross-classified. Every data point must go 
in one and only one cell of the table to in- 
dicate how the data classified by one cod- 
ing (row categories) is cross-classified by the 
other coding (column categories). The diag- 
onal from upper left to lower right in Table [I] 
represents the responses of judge X that co- 
incide with judge Y's; cells off the diagonal 
represent classification disagreements. 

The marginals in Table p] show that 61% of 
judge X's responses are in category A com- 
pared with 57% of Y's. Where .61 is taken 
to be the likelihood that X responds in cat- 
egory A, and .57 the likelihood that Y re- 
sponds in category A, then .61 x .57 of the 
time X and Y should agree that the same 
data point is classified in category A, as- 
suming nothing more than chance correspon- 
dence between X and Y's responses. Adding 
the result of the corresponding likelihood of 
agreement on response B yields pa e = 52%. 
The expected proportion of disagreement is 
similarly computed. By chance, X should re- 
spond A where Y responds B 26% of the time 
(.61 x .43). The difference between these ex- 



pected values and the observed agreements 
(.47 + .29) results in a reliability value of 
.50, as shown in (|9|) of Table [1]. 

Whenever the responses of two subjects can 
be cast in the form of a coincidence matrix, 
the reliability metrics illustrated above can 
be applied. Here I present a proposal for ap- 
plying reliability to coreference annotation, 
based on the insig htsin (jVilainet al., 1995Q . 



2 Evaluating Coreference 
Annotations 

Co-reference annotation is annotation of lan- 
guage data to indicate when distinct expres- 
sions have been used to corefer. Evaluating 
the reliability of such data is important for 
several reasons. First, any annotation task 
is subject to unintended errors arising from 
lack of attention on the part of the annota- 
tor. The likelihood of such errors depends in 
part on ergonomic factors such as what kinds 
of aids are provided for recording and check- 
ing annotations, and how much time the an- 
notator has to perform the task. In addi- 
tion, no matter how precise a language user 
might be, language interpretation is subjec- 
tive. A given expression can be referentially 
ambiguous or vague. Referential indetermi- 
nacy can even be intentional on the part of 
the speaker or writer. When annotations of 
the same data are collected from two or more 
coders, then in principle, the reliability of 
the data (or of the individual coders) can be 
quantified. 

Two language samples are presented in Fig- 
ure D that typify two quite different types 
of discourse. Sample 1 illustrates journal- 
istic text, and is taken from the Brown 
Corpus ( |Francis and Kucera, 1982] ). Sam- 
ple 2, illustrating spoken dialogue, is from 
the University of Rochester's Trains 91 cor- 
pus QGross et al., 199"3| ). Two samples are 



2 



Sample 1: Journalistic Text 
Committee approval of [Gov. Price Daniel's aban- 
doned property act]i seemed certain Thursday despite 
the adamant protests of Texas bankers. [Daniel^ per- 
sonally led the fight for [the measure] i, which [he] 2 
had watered down considerably since its rejection by 
two previous Legislatures, in a public hearing before 
the House Committee on Revenue and Taxation. Un- 
der committee rules, [it] 1 went automatically to a sub- 
committee for one week. But questions with which 
[committee members] 3 taunted bankers appearing as 
witnesses left little doubt that [they] 3 will recommend 
passage of [it] 1 . 



Sample 2: Problem- Solving Dialogue 

M: okay we need to ship a boxcar of oranges to Bath 
by 8 AM today S: okay M: umm okay so I guess uh I 
would suggest that we use [engine El]i uh and have 
[it] 1 pick up [a boxcar] 2 at ah Dansville how long'll 
[it]i take S: uh that'll take 3 hours to get to Dansville 
and get [the boxcar] 2 M: uh okay and then how long to 
go on to .. Corning with [the boxcar] 2 coupled to uh 
[El]i S: another hour M: ok so that's okay and then 
uh if we loaded [the oranges] 3 at ah Corning and sent 
ah [El]i on to Bath with [the oranges]3 S: we'd get 
there at 7 
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Figure 1: Co-reference annotation of two language samples 



shown to illustrate that despite major differ- 
ences of language variety, the task of conf- 
erence annotation is essentially the same for 
both types of data. Both samples have been 
■ annotated to indicate certe 



c certain expressions 
thai have been interpreted Lo corefer (how 
or why these particular expressions were se- 
lected is immaterial to the present discus- 
sion). Relevant phrases have been bracketed. 
Bracketed phrases that have been annotated 
'with the same numeric subscript represent 
expressions that, in the annotator's judge- 
ment, were used to corefer. For sample 1, 
eight expressions (A-H) were annotated as 
referring to one of three distinct referents. 
The coding of co-referential expressions is 
shown under column CA (Coreference Anno- 
tation). For sample 2, ten expressions (A- J) 
were annotated as referring to one of three 
distinct referents, whose indices are listed 
under the column headed CAi. An alternate 
coding is shown in column CA2. The remain- 



der of the discussion will focus on sample 2. 

How can a comparison of the two annota- 
tions of sample 2, CAi and CA 2 , be quan- 
tified? The key observations used in ( | V ilain 
et al., 1995| ) are that the sets of expressions 



that corefer constitute equivalence classes, 
and that in two annotations, a given expres- 
sion is either assigned to the same equiva- 
lence class or not. I first present how ( |Vilain| 
et al., 1995|) compute precision and recall by 



comparing equivalence classes across a pair 
of annotations. Then I show how a revision 
of their approach can be converted to re- 
liability measures, under certain important 
constraints. 

The first annotation for Sample 2 places five 
tokens into one equivalence class referring to 
the engine ({A, B, D, G, I}), and three to- 
kens into a class referring to the boxcar ({C, 
E, F}). This contrasts with the alternate 
annotation, where the same eight tokens are 
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in two equivalence classes, but where D is 
placed with C: {A, B, G, I}, {C, D, E, F}. 
To apply recall and precision, we must as- 
sume that one of the annotations is correct. 
In general, a recall error involves failure to 
identify members of a target set; a preci- 
sion error involves inclusion of additional ele- 
ments besides those in the target set. Vilain 
et al. (|1995|) observe that intuitively, a com- 
parison of two sets {A, B, D, G, 1} from CAi 
and {A, B, G, 1} from CA2, where the first 
set is the target, involves only a recall error. 
The CA 2 set does not include any additional 
elements, but it fails to include D. In con- 
trast, the comparison of {C, E, F} as the 
target with {C, D, E, F} involves a precision 
error and no recall errors. In practice, the 
method given in flVilain et al., 1995|) does not 
compare elements of corresponding sets, but 
compares how many links are needed to con- 
nect the elements within corresonding sets. 



To compute recall, ( | Vilain et al., 1995] ) start 
by creating a partition of a given target set 
from the corresponding response sets. This 
addresses the question of how many equiv- 
alence classes in the response set must be 
examined in order to reconstruct the target 
set, The relevant partition of {A, B, D, G, 1} 



L3 thu3 into the two 3ct3 (A, B, G, 1} , (D] . If 

the target set is conceived of as five nodes in 
a spanning tree (e.g., A-B-D-G-I), then the 
target "tree" can be constructed from the re- 
sponse by adding one link: a link from D to 
any node A, B, G or I. In general, the missing 
information for recall is quantified in terms 
of the number of links missing from the re- 
sponse partition. The number of links in a 
target equivalence class C is the cardinality 
of that class less 1: \C\ — 1. The number of 
links missing from the partition of C relative 
to the response (p(C)) is the cardinality of 
the partition less 1: |p(C)| — 1. The recall 
for a given equivalence class is thus the ratio 



of the target links less the missing links to 
the target links: 
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(10) 



When an equivalence class Cj in the target 
has an exact correspondence to one in the re- 
sponse, the cardinality of the partition p(C$) 
is 1, the numerator and denominator in ( |10D 
are the same, and recall is perfect. Recall for 
a complete annotation is expressed in terms 
of all the equivalences classes Cj in the tar- 
get annotation, by summing the recall errors 
(numerator) and summing the target links 
(denominator): 
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Recall C A 1 ,cA 2 
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(5-2)+(3-l)+(2-l) 
(5-l)+(3-l)+(2-l) 



(11) 
(12) 



Taking CA\ as the target, formula ( |TTD gives 
a recall for CA2 of .86, as shown in ([121) . 

Computation of precision in ( [Vilain et al. 



1995|) is the converse of the computation of 



recall. To illustrate, precision will be com- 
puted for the target set {C, E, F}. Precision 
is imperfect because the response set has an 
additional member: {C, D, E, F}. Where 
the response set is R, a partition of the re- 
sponse set relative to the target sets (p(R)) 
gives the two sets {C, E, F} and {D}. Pre- 
cision of the target set C is then the ratio of 
the difference between the cardinality of the 
corresponding response set R and the cardi- 
nality of its partition p(R) to the cardinality 
of the response set R less 1: 



Precision, 



c 



\R\ - \p(R)\ 
(l«|-i) 



(13) 
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M: okay we need to ship a boxcar of oranges to Bath by 8 AM today S: okay M: umm okay so I guess uh I would 
suggest that we use [engine El]i uh and have [it] 1 pick up [a boxcar]2 at ah Dansville how long'll [it] 3 take S: uh 
[that^'ll take 3 hours to get to Dansville and get [the boxcar]2 M: uh okay and then how long to go on to .. Corning 
with [the boxcar]2 coupled to uh [El]i S: another hour M: ok so that's okay and then uh if we loaded [the oranges]4 
at ah Corning and sent ah [El]i on to Bath with [the oranges]4 S: we'd get there at 7 



Token 



String 



CAi CA 3 



A 

B 

C 

D 

D' 

E 

F 

G 

H 

I 

J 



engine El 
it 

a boxcar 
it 

that 

the boxcar 
the boxcar 
El 

the oranges 
El 

the oranges 



CAi Equivalence classes 
{A, B, D, G, 1} 
{C, E, F} 
{H, J} 



CA3 Equvalence classes 
{A, B, G, 1} 
{C, E, F} 
{D, D'} 
{H, J} 



Figure 2: Alternate co-reference annotation of sample 2 



Precision- Iggjpjp (11> 

Precision CAuCA2 = fciufciit^- iT ( 15 ) 

Precision for the equivalence class {C, E, F} 
is then |^f, or .33. Precision of the entire 
coding CA 2 relative to CA\ is .86, as shown 

hiflD- 

2.1 Problems 

A perhaps more realistic alternate coding for 
sample 2 is shown in Figure ^. The token 
identified in Figures [l|-|2] as D was coded as 
coreferential with the expression engine El 
(token A) in annotation CA\. In annotation 
CA% shown in Figure |2|, this token is inter- 
preted to refer to the process of getting en- 
gine El to pick up a boxcar at Dansville, and 
is annotated as coreferential with a token 
of the demonstrative pronoun that — shown 
here as token D'. D' was not originally in- 



index of 4 in coding CA\ to indicate lack of 
coreference with any other expression. I will 
use a comparison of codings CA\ and CA 3 



to illustrate how the approach taken in QVi-| 
lain et al., 1995|) presents certain problems 
for computing reliability, and for evaluating 
the type of annotation employed in ([Passon- 
neau and Litman, 19971) . 

Both of the problems discussed here pertain 
to the manner in which recall and precision 
is applied to data, rather than to the actual 
computation of recall and precision. The 
first problem is that ( |Vilain ct al., 1995|) do 
not constrain the sets of referring expres- 
sions that are being compared to have the 
same cardinality. The second is that they 
apply their method only to referring expres- 
sions that corefer with at least one other 
expression. My proposed solution requires 
that two annotations have the same cardi- 
nality of referring expressions. It also per- 
mits an annotator to interpret an expression 
as having no coreferential expressions, as in 
D' for coding CA\ (Figure @). As I show 
below, these two moves make it possible to 
retain the basic insight from ( [Vilain et al.,| 
__), to compute reliability, and to apply 
the method to a broader range of annotation 
approaches, including the annotation style 
presented in ( Passonneau, 1997j ). 



eluded in CA\ , but is given here an arbitrary 1995| ) 
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The fundamental problem in comparing cod- 
ings CA\ and CA 3 is that the two data 
sets are incommensurate. Coding CA\ origi- 
nally placed ten expressions into equivalence 
classes, while coding CA 3 does so for eleven 
expressions. This prevents creation of a con- 
tingency table, and is thus an obstacle to 
applying reliability measures (cf. section [l]). 

The approach in ( [Vilain et al., 1995|) does 
not require two codings to be commensu- 
rate in part because the annotators' task, 
as described in ( |Hirschman, 1996|) , has two 
parts: to identify the expressions to be 
coded, or markables, and to place markables 
into equivalence classes based on the conf- 
erence relation. As I argue in ( Passonneau 



1997 ), there are several disadvantages to this 
approach. Identifying markables is a concep- 
tually distinct task, can be partly automated 
with easily accessible and relatively simple 
tools, such as part-of-speech taggers, and is 
a language specific task. In contrast, confer- 
ence is difficult to automate (particularly in 
a sufficiently general way to apply across cor- 
pora), and is language independent. I take 
the evaluation of how markables are identi- 
fied to be a separate problem. My goal is 
then to evaluate the inter-rater reliability of 
co-reference annotations, assuming that each 
rater is given the same set of markables to 
annotate. 

Another serious drawback, of particular con- 
cern to investigators in the natural language 
generation community, is that the approach 
taken in ( |Vilain et al., 19951 ) f ans to identify 
referential expressions comprising a single- 
ton equivalence class. Instead, such expres- 
sions are omitted from consideration. How- 
ever, it is of as much concern to determine 
the conditions under which a referent is men- 
tioned only once, as to determine those un- 
der which it is re-mentioned. If two coders 
place the same expression in a class by it- 



self, indicating lack of any coreferential ex- 
pressions, note that recall and precision will 
both be zero. While at first this may seem 
counter-intuitive, it is entirely reasonable. 
First, what is being evaluated is the ability of 
distinct coders to find the same coreference 
links. In the case of comparing a singleton 
set to an identical singleton set, there are no 
coreference links to find. But note that no 
mismatching links have been identified. 





Coding 


',GAx 


Coding CA Z 


+Link 


-Link 


+Link 


a 


b 


-Link 


c 


d 




a+c 


b+d 



Recall 



Precision 



a+b 
c+d 
i+b+c+d 

- \p(a)\ 



(16) 



(17) 



Figure 3: Schematic representation of a 2- 
by-2 coincidence matrix 

Consider the result of imposing the require- 
ment that two coreference codings must par- 
tition the same set of expressions into equiv- 
alence classes of coreference. If we assume 
that coding CA% represents an annotator's 
judgement that token D' is in a singleton set, 
then we can create a contingency table of the 
two codings. The table total represents the 
total number of possible coreference links. In 
the case of codings CA\ and CA 3 , the table 
total is the cardinality of the set of tokens 
less 1, which is ten. To compute reliabil- 
ity, we need the four quantities a — d given 
in each cell of the table shown in Figure |3] 
(cf. Table |1|). Of all possible coreference 
links, some will be identified by both coders. 
This is quantity a in Figure |3|. Some will 
be identified by neither coder: quantity d in 
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Figure |3|. Thus a and d represent the two 
types of agreement between coders: agree- 
ment on coreference links, and agreement on 
their absence. In contrast, quantities b and c 
represent disagreements: the first coder finds 
links that the second coder does not, or vice 
versa. 

Recall and precision are defined as illus- 
trated in fll6D and ([[7]) of Figure ^| QRijs - 



bergen, 1979Q . Recall represents the ratio 
of links found in both the target and some 
test set, hence is the ratio of a to a + a 
By setting this ratio equal to (f[Cj), the ra- 
tio proposed in QVilain et al., 1995) , we can 
begin to identify the individual quantities a 
through d. Precision represents the propor- 
tion of links found in some test set that are 
also in the target, hence is the ratio of a to 
a + b. As shown in (1171), this ratio can 



be equated to (fH|). Given the table total 
and the two equalities fll6|) and (p]), the four 
quantities a through d can be computed. 

Recall that quantity a is the coreference links 
agreed on by CA\ and CA 3 . By (1TB) and 
(|T7|), it is the sum of the differences of the 
cardinality of each equivalence class in CA\ 
less the cardinality of its partition by CA 3 . 
Equivalently, a is the sum of the differences 
of the cardinality of each equivalence class in 
CA 3 less the cardinality of its partition by 
corresponding equivalence classes in GA\\ 

a= (5 -2) + (3-1) + (1-1) + (2-1) 
o= (4-1) + (3-1) + (2 -2) + (2-1) 



6= ((4-1) + (3-1) + (2-1) + (2-1)) -6 
b = 1 



Conversely, cell value c represents the coref- 
erence links identified in GA\ but not in 
CA$. It is the sum of the number of links 
for each equivalence class in CA\ less the 
coreference links found by both: 



c= ((5-1) + (3-l) + (l-l) + (2-l)) -6 
c= 1 



It remains to calculate d, the possible links 
that neither coder identifies. We know the 
total possible coreference links: a+b+c+d = 
10. And we know the values of a, b and c 
(a=6; b=c=l), thus d = 2. Another way to 
compute a and d is to compute the full par- 
tition of the equivalence classes in both cod- 
ings (p(CA)), giving all links found in both 
codings: 

p(CA) = {A, B, G, I}, {C, E, F}, {D}, {D'}, {H, J} 

Note that the value of a (links agreed on by 
both coders) is the sum of the differences of 
the cardinality of each set in the partition 
p(CA) less 1: 



0= (4-1) + (3-1) + (1-1) + (1-1) + (2-1) 
a — 6 



Cell value b represents the coreference links 
identified in CA% but not in CA\. It is the 
sum of the number of links for each equiv- 
alence class in CA 3 ~ 1) l ess the 
coreference links found by both: 



Then take the intersection of either GA\ or 
CA3 with p(CA). The value of d is the car- 
dinality of either intersection less 1: 

CA 1 n P (CA) = {C,E,F},{D'},{H,J} 
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Table 2: Coincidence matrix for CAi by 
CA 3 
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d 
d 
d 



{A,B,G,I},{C,E,F},{H, J} 
\CAi n p(CA)\ - 1 
\CA 3 n P (CA)\ - 1 
2 



The contingency table for comparing CAi 
and CA 3 using the cell values we have just 
computed is given in Table |2|. 

2.2 Conversion to Reliability 

Now that we see how to construct a con- 
tingency table for coreference annotation, 
it is straightforward to compute reliability 
Given that recall and precision are both just 
over 85%, one might interpret the similar- 
ity of the coding as being moderately good. 
However, as shown in (p0|)-(p8|), reliability is 



Table 3: Coincidence matrix for R4 by R2 
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poor. — The interpretation of the n value of and Litman 

.52 is thai reliability is about halfway be- 



Table [3] compares the k reliability score with 
recall and precision for an actual coding of a 
spoken narrative from ( Chafe, 1980Q . One 
coding represents the consensus coding of 
coreference arrived at by the two investiga- 
tors in the study reported in ( [Passonneau 
1997]) - The other coding was 



tween completely random behavior [kappa 
= 0) and perfect reliability (near l). 1 



PA C 



PAi 



= .6 



.2 



(20) 
(21) 



A negative kappa value represents positive unr elia- 
bility, as o pposed to random correspondence. See ( |Co-| 
ben, 1960| ) for a discussion of the upper and lower lim- 
its of k assuming pa e is derived from marg inals of a 
coincidence matrix. See ( Krippendorff , 1980|) for other 



methods of computing pa e , and for applying reliability 
to continuous variables, etc. 



performed by a student with no linguistics 
background but some training in coreference 
annotation. As illustrated, the recall and 
precision scores are both apparently good 
(90% or above), but the k score is only .65. 
This demonstrates concretely that because 
recall and precision do not factor out chance 
agreement, they can be misleading. In con- 
trast, as discussed in section [l], k quanti- 
fies the proportion of agreements among two 
coders that are above chance. In Table ^ 
both coders agree on 166 out of 242 corefer- 
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ence links (upper left cell). Because of the 
relatively high value of this cell, both recall 
and precision will be high (cf. Figure |3|). 
But in addition, because the proportion of 
coreference links is very high for both R4 
(179/242) and R 2 (185/242), the chance of 
agreement on coreference links (or their ab- 
sence) is also relatively high. Factoring out 
this chance agreement results in poor relia- 
bility. 

Table ^ compares the k scores with recall 
and precision for the same coder's annota- 
tions of ten narratives from flChafe, 1980|) 
against the codings used in ( [Fassonneau and 
Litman, 1997] ). Narrative one, with of 



.85 compared with recall and precision of 
.96, illustrates the general trend that the 
k scores are good, but not as high as one 
might assume given the generally high re- 
call and precision. The last line of the ta- 
ble gives the standard deviation (a) for each 
metric. Note that the standard deviation of 
the reliability measures is over 3 times that 
for recall and precision. A log kept by the 
coder of questions that arose during anno- 
tation suggests that the variation in relia- 
bility reflects differences in the coherence of 
the narratives, and the types of referential 
phenomena that occur, rather than incon- 
sistency in the coder's behavior. For exam- 
ple, in this log the coder reported greatest 
difficulty with narratives 9 (a=.75) and 12 
(a=.74), and used the phrases "I am con- 
fused, I don't understand what he is talking 
abouf to describe particular coding prob- 
lems. In contrast, the coder described nar- 
rative 16 (a=.93) as "pretty easy to code." 



Narr. 


K 


Recall 


Precision 


1 


.85 


.96 


.96 


2 


.65 


.90 


.93 


3 


.72 


.93 


.94 


4 


.89 


.94 


.98 


5 


.89 


.95 


.99 





on 
.80 


.34 


.y 1 


8 


.84 


.91 


.96 


9 


.75 


.88 


.96 


11 


.79 


.92 


.95 


12 


.74 


.90 


.92 


15 


.80 


.93 


.93 


16 


.93 


.97 


.98 


17 


.86 


.95 


.96 


18 


.84 


.93 


.96 


19 


.85 


.96 


.93 


a 


.07 


.02 


.02 



Table 4: Comparing Inter-rater Reliability 
of Coreference Annotations with Recall and 
Precision 

compute reliability. Building on this obser- 
vation, I have shown how the method in Vi- 
lain et al. ( |1995| ) for computing recall and 
precision for coreference annotation can be 
used to construct a coincidence matrix, and 
therefore to compute reliability. Each type 
of metric has its own uses. If a target or 
correct annotation has been established, it 
may be appropriate to evaluate recall and 
precision of a new coding against the target. 
However, in developing new annotated cor- 
pora with no pre-exising answer key, so to 
speak, it is important to evaluate the relia- 
bility of individual coders and of the datasets 
they produce. The data presented in the 
preceding section (Tables ^|-|) demonstrate 
that one should not infer from high recall 
and precision of one annotation against an- 
other that either annotation is reliable, in 
the sense of reliability discussed in (|Cohen,| 
TMID and ( IKrippendorff, l98U| ). Reliability 



3 Summary 

A 2-by-2 coincidence matrix can be used to 
compute information retrieval metrics, or to 



measures should be used to identify reliable 
annotators and annotations. By merging the 
best data from mutually reliable codings, a 
more correct coding can be derived for a new 



corpus. Reliability scores can be used to 
determine whether a coder is trainable (im- 
provements over time), and when the train- 
ing can be terminated (no further improve- 
ment). 

Poor reliability can be an indicator of omis- 
sions or flaws in a coding scheme. In ad- 
dition, reliability metrics can help the re- 
searcher identify data that is consistently not 
agreed upon among multiple coders. This 
might occur within a single discourse for par- 
ticular kinds of coreference phenomena. Or 
it might occur for an entire discourse as com- 
pared with other discourses, e.g., if the dis- 
course in question is unclear, vague, or oth- 
erwise non-optimal for coreference interpre- 
tation. 

Acknowledgements 

Thanks to Pamela Jordan, Diane Litman, 
and Marilyn Walker for helpful comments on 
this report. 



[Krippendorffl980] Klaus Krippendorff. 1980. Con- 
tent Analysis: An Introduction to Its Methodology. 
Sage Publications, Beverly Hills, CA. 

[Passonneau and Litmanl997] Rebecca J. Passon- 
neau and Diane Litman. 1997. Discourse segmen- 
tation by human and automated means. Compu- 
tational Linguistics, 23.1:103-139. Special Issue on 
Empirical Studies in Discourse Interpretation and 
Generation. 

[Passonneaul997] Rebecca J. Passonneau. 1997. In- 
structions for applying discourse reference annota- 
tion for multiple applications (DRAMA). Technical 
report, Columbia University. 

[Rijsbergenl979] C. J. Van Rijsbergen. 1979. Infor- 
mation Retrieval. Butterworths, London. 

[Vilain ct al.1995] 

Marc Vilain, John Burger, John Aberdeen, Dennis 
Connolly and Lynette Hirschman. 1995. A model- 
theoretic coreference scoring scheme. In Proceedings 
of the 6th Message Understanding Conference, pages 
45-52, San Francisco. Morgan Kaufmann. 



References 

[Chafel980] Wallace L. Chafe. 1980. The Pear Sto- 
ries: Cognitive, Cultural and Linguistic Aspects of 
Narrative Production. Ablex Publishing Corpora- 
tion, Norwood, NJ. 

[Cohenl960] Jacob Cohen. 1960. A coefhecient of 
agreement for nominal scales. Educational and Psy- 
chological Measurement, 20:37-46. 

[Francis and Kuceral982] W. Francis and H. Kucera. 
1982. Frequency Analysis of English Usage: Lexicon 
and Grammar. Houghton Mifflin, Boston, MA. 

[Gross et al.1993] Derek Gross, James Allen, and 
David R. Traum. 1993. The Trains 91 dialogues. 
Technical Report 92-1, University of Rochester, 
Rochester, NY. 

[Hirschmanl996] Lynette Hirschman. 1996. Corefer- 
ence specification for MUC (MUCCS). Version 4.0, 
Oct. 29, 1996. 



10 



